load("~/Documents/Nonparametric Statisics/Project/clean data/functional/df_week.RData")
first we start with outlier detection in the functional case
fd_week <- fData(1:7,as_tibble(df_week[,3:9]))
plot(fd_week,lwd = 3,xlab = "day",ylab = "dayly number of crashes",main = "dayly crashes in each week")
functional bagplot:
week_fbagplot <- fbplot(fd_week, main="Magnitude outliers weekly data")
the default F is:
week_fbagplot$Fvalue
[1] 1.5
no outliers found
we need to remove the weeks that cause problems witth 0 (1,52,53)
df_week2 <- df_week %>% filter(week_of_year %in% 2:51)
fd_week2 <- fData(1:7,as_tibble(df_week2[,3:9]))
week_fbagplot2 <- fbplot(fd_week2, main="Magnitude outliers week data",
adjust = list( N_trials = 20,trial_size = fd_week2$N,
VERBOSE = TRUE ))
* * * Iteration 1 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 2 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 3 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 4 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 5 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 6 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 7 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 8 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 9 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 10 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 11 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 12 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 13 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 14 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 15 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 16 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 17 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 18 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 19 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 20 / 20
* * * * beginning optimization
* * * * optimization finished.
the chosen F value is:
week_fbagplot2$Fvalue
[1] 0.6115131
the outlying years are:
df_week2[week_fbagplot2$ID_outliers,1:2]
first years ando covid year
outiliergram:
invisible(out_week<- outliergram(fd_week,adjust = F,lwd = 3,display = F))
the found outliers are:
df_week[out_week$ID_outliers,1:2]
the plot of the original function is not working.
adjusting the F:
out_week <- outliergram(fd_week,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_week$N,
VERBOSE = TRUE ),display = FALSE)
* * * Iteration 1 / 20
* * * * beginning optimization
* * * Iteration 2 / 20
* * * * beginning optimization
* * * Iteration 3 / 20
* * * * beginning optimization
* * * Iteration 4 / 20
* * * * beginning optimization
* * * Iteration 5 / 20
* * * * beginning optimization
* * * Iteration 6 / 20
* * * * beginning optimization
* * * Iteration 7 / 20
* * * * beginning optimization
* * * Iteration 8 / 20
* * * * beginning optimization
* * * Iteration 9 / 20
* * * * beginning optimization
* * * Iteration 10 / 20
* * * * beginning optimization
* * * Iteration 11 / 20
* * * * beginning optimization
* * * Iteration 12 / 20
* * * * beginning optimization
* * * Iteration 13 / 20
* * * * beginning optimization
* * * Iteration 14 / 20
* * * * beginning optimization
* * * Iteration 15 / 20
* * * * beginning optimization
* * * Iteration 16 / 20
* * * * beginning optimization
* * * Iteration 17 / 20
* * * * beginning optimization
* * * Iteration 18 / 20
* * * * beginning optimization
* * * Iteration 19 / 20
* * * * beginning optimization
* * * Iteration 20 / 20
* * * * beginning optimization
out_week$Fvalue
[1] 2.373441
nothing changed, same outliers detected.
df_week[out_week$ID_outliers,1:2]
plotting in the old way.
par(mfrow=c(1,2))
plot(fd_week[out_week$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_week[-out_week$ID_outliers,],lwd = 1,main = "non outliers",col = 3)
trying to remove the weeks with the zeros:
week2_fbagplot <- fbplot(fd_week2, main="Magnitude outliers weekly data")
the default F is:
week2_fbagplot$Fvalue
[1] 1.5
no outliers found
df_week[week2_fbagplot$ID_outliers,1:2]
week_fbagplot2 <- fbplot(fd_week2, main="Magnitude outliers week data",
adjust = list( N_trials = 20,trial_size = fd_week2$N,
VERBOSE = TRUE ))
* * * Iteration 1 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 2 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 3 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 4 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 5 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 6 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 7 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 8 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 9 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 10 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 11 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 12 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 13 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 14 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 15 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 16 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 17 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 18 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 19 / 20
* * * * beginning optimization
* * * * optimization finished.
* * * Iteration 20 / 20
* * * * beginning optimization
* * * * optimization finished.
the chosen F value is:
week_fbagplot2$Fvalue
[1] 0.631645
the outlying years are:
df_week2[week_fbagplot2$ID_outliers,1:2]
first years ando covid year
outiliergram:
invisible(out_week2<- outliergram(fd_week2,adjust = F,lwd = 3,display = F))
the found outliers are:
df_week[out_week$ID_outliers,1:2]
the plot of the original function is not working.
adjusting the F:
out_week2 <- outliergram(fd_week2,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_week$N,
VERBOSE = TRUE ),display = FALSE)
* * * Iteration 1 / 20
* * * * beginning optimization
* * * Iteration 2 / 20
* * * * beginning optimization
* * * Iteration 3 / 20
* * * * beginning optimization
* * * Iteration 4 / 20
* * * * beginning optimization
* * * Iteration 5 / 20
* * * * beginning optimization
* * * Iteration 6 / 20
* * * * beginning optimization
* * * Iteration 7 / 20
* * * * beginning optimization
* * * Iteration 8 / 20
* * * * beginning optimization
* * * Iteration 9 / 20
* * * * beginning optimization
* * * Iteration 10 / 20
* * * * beginning optimization
* * * Iteration 11 / 20
* * * * beginning optimization
* * * Iteration 12 / 20
* * * * beginning optimization
* * * Iteration 13 / 20
* * * * beginning optimization
* * * Iteration 14 / 20
* * * * beginning optimization
* * * Iteration 15 / 20
* * * * beginning optimization
* * * Iteration 16 / 20
* * * * beginning optimization
* * * Iteration 17 / 20
* * * * beginning optimization
* * * Iteration 18 / 20
* * * * beginning optimization
* * * Iteration 19 / 20
* * * * beginning optimization
* * * Iteration 20 / 20
* * * * beginning optimization
out_week2$Fvalue
[1] 2.353695
nothing changed, same outliers detected.
df_week[out_week2$ID_outliers,1:2]
plotting in the old way.
par(mfrow=c(1,2))
plot(fd_week[out_week2$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_week[-out_week2$ID_outliers,],lwd = 1,main = "non outliers",col = 3)
we can do some clustering, no need to allign the data this time:
days <- 1:7
n <- fd_week$N
x <- t(matrix(rep(days,n),7,n))
y <- as.matrix(df_week[,3:9])
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
seeds = sample(1:n,k),
warping_class = "none",
metric = "pearson",
centroid_type = "mean",
distance_relative_tolerance = 1e-3,
add_silhouettes = F,
parallel_method = 0L,
number_of_threads = 12
)
)
Information about the data set:
- Number of observations: 943
- Number of dimensions: 1
- Number of points: 7
Information about cluster initialization:
- Number of clusters: 3
- Initial seeds for cluster centers: 15 379 723
Information about the methods used within the algorithm:
- Warping method: none
- Center method: mean
- Dissimilarity method: pearson
- Optimization method: bobyqa
Information about warping parameter bounds:
- Warping options: 0.1500 0.1500
Information about convergence criteria:
- Maximum number of iterations: 100
- Distance relative tolerance: 0.001
Information about parallelization setup:
- Number of threads: 12
- Parallel method: 0
Other information:
- Use fence to robustify: 0
- Check total dissimilarity: 1
- Compute overall center: 0
Running k-centroid algorithm:
- Iteration #1
* Size of cluster #0: 417
* Size of cluster #1: 312
* Size of cluster #2: 214
- Iteration #2
* Size of cluster #0: 385
* Size of cluster #1: 293
* Size of cluster #2: 265
- Iteration #3
* Size of cluster #0: 369
* Size of cluster #1: 292
* Size of cluster #2: 282
- Iteration #4
* Size of cluster #0: 361
* Size of cluster #1: 282
* Size of cluster #2: 300
- Iteration #5
* Size of cluster #0: 351
* Size of cluster #1: 273
* Size of cluster #2: 319
- Iteration #6
* Size of cluster #0: 341
* Size of cluster #1: 272
* Size of cluster #2: 330
- Iteration #7
* Size of cluster #0: 336
* Size of cluster #1: 273
* Size of cluster #2: 334
- Iteration #8
* Size of cluster #0: 337
* Size of cluster #1: 268
* Size of cluster #2: 338
- Iteration #9
* Size of cluster #0: 334
* Size of cluster #1: 269
* Size of cluster #2: 340
- Iteration #10
* Size of cluster #0: 335
* Size of cluster #1: 269
* Size of cluster #2: 339
- Iteration #11
* Size of cluster #0: 331
* Size of cluster #1: 270
* Size of cluster #2: 342
- Iteration #12
* Size of cluster #0: 328
* Size of cluster #1: 268
* Size of cluster #2: 347
- Iteration #13
* Size of cluster #0: 327
* Size of cluster #1: 268
* Size of cluster #2: 348
- Iteration #14
* Size of cluster #0: 326
* Size of cluster #1: 268
* Size of cluster #2: 349
- Iteration #15
* Size of cluster #0: 326
* Size of cluster #1: 268
* Size of cluster #2: 349
- Iteration #16
* Size of cluster #0: 325
* Size of cluster #1: 266
* Size of cluster #2: 352
Active stopping criteria:
- The total dissimilarity did not decrease.
user system elapsed
2.482 0.130 2.584
autoplot(fdakma0der,type = "amplitude")
selecting the number of clusters:
n_sub <- 50
sub_id <- sample(1:n,n_sub,replace = FALSE)
x_sub <- x[sub_id,]
y_sub <- y[sub_id,]
system.time(invisible(comparison_kmeans <- compare_caps(
x_sub,
y_sub,
n_clusters = 2:5,
metric = "pearson",
clustering_method = "kmeans",
warping_class = "none",
centroid_type = "mean",
cluster_on_phase = FALSE
)
)
)
plot(comparison_kmeans, validation_criterion = "wss", what = "mean",lwd = 3)
plot(comparison_kmeans, validation_criterion = "wss", what = "distribution")
plot(comparison_kmeans, validation_criterion = "silhouette", what = "mean")
plot(comparison_kmeans, validation_criterion = "silhouette", what = "distribution")
2 is probably the best, lets take a look:
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
seeds = sample(1:n,k),
warping_class = "none",
metric = "pearson",
centroid_type = "mean",
distance_relative_tolerance = 1e-3,
add_silhouettes = F,
parallel_method = 0L,
number_of_threads = 12
)
)
Information about the data set:
- Number of observations: 943
- Number of dimensions: 1
- Number of points: 7
Information about cluster initialization:
- Number of clusters: 2
- Initial seeds for cluster centers: 508 111
Information about the methods used within the algorithm:
- Warping method: none
- Center method: mean
- Dissimilarity method: pearson
- Optimization method: bobyqa
Information about warping parameter bounds:
- Warping options: 0.1500 0.1500
Information about convergence criteria:
- Maximum number of iterations: 100
- Distance relative tolerance: 0.001
Information about parallelization setup:
- Number of threads: 12
- Parallel method: 0
Other information:
- Use fence to robustify: 0
- Check total dissimilarity: 1
- Compute overall center: 0
Running k-centroid algorithm:
- Iteration #1
* Size of cluster #0: 863
* Size of cluster #1: 80
- Iteration #2
* Size of cluster #0: 798
* Size of cluster #1: 145
- Iteration #3
* Size of cluster #0: 720
* Size of cluster #1: 223
- Iteration #4
* Size of cluster #0: 663
* Size of cluster #1: 280
- Iteration #5
* Size of cluster #0: 613
* Size of cluster #1: 330
- Iteration #6
* Size of cluster #0: 574
* Size of cluster #1: 369
- Iteration #7
* Size of cluster #0: 547
* Size of cluster #1: 396
- Iteration #8
* Size of cluster #0: 533
* Size of cluster #1: 410
- Iteration #9
* Size of cluster #0: 521
* Size of cluster #1: 422
- Iteration #10
* Size of cluster #0: 510
* Size of cluster #1: 433
- Iteration #11
* Size of cluster #0: 503
* Size of cluster #1: 440
- Iteration #12
* Size of cluster #0: 496
* Size of cluster #1: 447
- Iteration #13
* Size of cluster #0: 491
* Size of cluster #1: 452
- Iteration #14
* Size of cluster #0: 490
* Size of cluster #1: 453
- Iteration #15
* Size of cluster #0: 490
* Size of cluster #1: 453
Active stopping criteria:
- Memberships did not change.
user system elapsed
1.790 0.093 1.865
autoplot(fdakma0der,type = "amplitude")
table(df_week[fdakma0der$memberships==1,2])
week_of_year
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
10 10 4 9 7 6 5 10 10 4 7 9 9 11 13 12 12 13 13 7 10 14 10 10 11 10 12 9 5 11 12 13
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
10 14 14 12 13 10 10 11 7 5 11 13 6 9 5 9 7 4 1 7 4
table(df_week[fdakma0der$memberships==2,2])
week_of_year
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
8 8 14 9 11 12 13 8 8 14 11 9 9 7 5 6 6 5 5 11 8 4 8 8 7 8 6 9 13 7 6 5
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
8 4 4 6 5 8 8 7 11 13 7 5 12 9 13 9 11 14 17 11 3
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
this is probably holiday vs not holiday
n2 <- fd_week2$N
x2 <- t(matrix(rep(days,n2),7,n2))
y2 <- as.matrix(df_week2[,3:9])
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
seeds = sample(1:n,k),
warping_class = "none",
metric = "pearson",
centroid_type = "mean",
distance_relative_tolerance = 1e-3,
add_silhouettes = F,
parallel_method = 0L,
number_of_threads = 12
)
)
Information about the data set:
- Number of observations: 900
- Number of dimensions: 1
- Number of points: 7
Information about cluster initialization:
- Number of clusters: 2
- Initial seeds for cluster centers: 227 688
Information about the methods used within the algorithm:
- Warping method: none
- Center method: mean
- Dissimilarity method: pearson
- Optimization method: bobyqa
Information about warping parameter bounds:
- Warping options: 0.1500 0.1500
Information about convergence criteria:
- Maximum number of iterations: 100
- Distance relative tolerance: 0.001
Information about parallelization setup:
- Number of threads: 12
- Parallel method: 0
Other information:
- Use fence to robustify: 0
- Check total dissimilarity: 1
- Compute overall center: 0
Running k-centroid algorithm:
- Iteration #1
* Size of cluster #0: 103
* Size of cluster #1: 797
- Iteration #2
* Size of cluster #0: 195
* Size of cluster #1: 705
- Iteration #3
* Size of cluster #0: 270
* Size of cluster #1: 630
- Iteration #4
* Size of cluster #0: 329
* Size of cluster #1: 571
- Iteration #5
* Size of cluster #0: 362
* Size of cluster #1: 538
- Iteration #6
* Size of cluster #0: 383
* Size of cluster #1: 517
- Iteration #7
* Size of cluster #0: 401
* Size of cluster #1: 499
- Iteration #8
* Size of cluster #0: 411
* Size of cluster #1: 489
- Iteration #9
* Size of cluster #0: 414
* Size of cluster #1: 486
- Iteration #10
* Size of cluster #0: 417
* Size of cluster #1: 483
- Iteration #11
* Size of cluster #0: 420
* Size of cluster #1: 480
- Iteration #12
* Size of cluster #0: 425
* Size of cluster #1: 475
- Iteration #13
* Size of cluster #0: 427
* Size of cluster #1: 473
- Iteration #14
* Size of cluster #0: 429
* Size of cluster #1: 471
- Iteration #15
* Size of cluster #0: 430
* Size of cluster #1: 470
- Iteration #16
* Size of cluster #0: 430
* Size of cluster #1: 470
Active stopping criteria:
- Memberships did not change.
user system elapsed
1.706 0.084 1.768
autoplot(fdakma0der,type = "amplitude")
table(df_week2[fdakma0der$memberships==1,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
8 14 9 11 12 13 8 8 14 11 9 9 7 5 6 5 5 5 11 8 4 8 8 7 8 6 9 13 7 6 5 8
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
4 4 6 5 9 8 7 11 13 7 4 12 9 13 9 11 14 17
table(df_week2[fdakma0der$memberships==2,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
10 4 9 7 6 5 10 10 4 7 9 9 11 13 12 13 13 13 7 10 14 10 10 11 10 12 9 5 11 12 13 10
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
14 14 12 13 9 10 11 7 5 11 14 6 9 5 9 7 4 1
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
no difference for excluding the problematic weeks.
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
seeds = sample(1:n,k),
warping_class = "none",
metric = "pearson",
centroid_type = "mean",
distance_relative_tolerance = 1e-3,
add_silhouettes = F,
parallel_method = 0L,
number_of_threads = 12
)
)
Information about the data set:
- Number of observations: 900
- Number of dimensions: 1
- Number of points: 7
Information about cluster initialization:
- Number of clusters: 3
- Initial seeds for cluster centers: 592 357 277
Information about the methods used within the algorithm:
- Warping method: none
- Center method: mean
- Dissimilarity method: pearson
- Optimization method: bobyqa
Information about warping parameter bounds:
- Warping options: 0.1500 0.1500
Information about convergence criteria:
- Maximum number of iterations: 100
- Distance relative tolerance: 0.001
Information about parallelization setup:
- Number of threads: 12
- Parallel method: 0
Other information:
- Use fence to robustify: 0
- Check total dissimilarity: 1
- Compute overall center: 0
Running k-centroid algorithm:
- Iteration #1
* Size of cluster #0: 383
* Size of cluster #1: 156
* Size of cluster #2: 361
- Iteration #2
* Size of cluster #0: 353
* Size of cluster #1: 185
* Size of cluster #2: 362
- Iteration #3
* Size of cluster #0: 336
* Size of cluster #1: 202
* Size of cluster #2: 362
- Iteration #4
* Size of cluster #0: 318
* Size of cluster #1: 219
* Size of cluster #2: 363
- Iteration #5
* Size of cluster #0: 306
* Size of cluster #1: 230
* Size of cluster #2: 364
- Iteration #6
* Size of cluster #0: 293
* Size of cluster #1: 241
* Size of cluster #2: 366
- Iteration #7
* Size of cluster #0: 291
* Size of cluster #1: 247
* Size of cluster #2: 362
- Iteration #8
* Size of cluster #0: 291
* Size of cluster #1: 249
* Size of cluster #2: 360
- Iteration #9
* Size of cluster #0: 289
* Size of cluster #1: 251
* Size of cluster #2: 360
- Iteration #10
* Size of cluster #0: 287
* Size of cluster #1: 252
* Size of cluster #2: 361
- Iteration #11
* Size of cluster #0: 284
* Size of cluster #1: 253
* Size of cluster #2: 363
- Iteration #12
* Size of cluster #0: 282
* Size of cluster #1: 256
* Size of cluster #2: 362
- Iteration #13
* Size of cluster #0: 279
* Size of cluster #1: 260
* Size of cluster #2: 361
Active stopping criteria:
- The total dissimilarity did not decrease.
user system elapsed
1.863 0.111 1.956
autoplot(fdakma0der,type = "amplitude")
table(df_week2[fdakma0der$memberships==1,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
3 3 7 6 2 3 6 6 4 5 5 5 6 9 4 7 5 5 5 8 6 3 5 7 7 7 8 4 12 9 11 6
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
10 7 6 5 9 7 7 4 2 8 9 4 4 3 3 4 1
table(df_week2[fdakma0der$memberships==2,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
10 2 2 4 8 4 5 5 1 3 7 5 9 5 9 6 9 9 3 5 8 7 5 4 5 5 2 2 3 4 3 5
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
4 9 7 8 2 4 6 4 6 4 6 4 7 4 7 5 4 1
table(df_week2[fdakma0der$memberships==3,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 13 9 8 8 11 7 7 13 10 6 8 3 4 5 5 4 4 10 5 4 8 8 7 6 6 8 12 3 5 4 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
4 2 5 5 7 7 5 10 10 6 3 10 7 11 8 9 13 17
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
id3 <- which(fdakma0der$memberships==3)
par(mfrow=c(1,3))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plot(fd_week2[id3[1:30],],lwd = 1,main = "cluster 3",col = 4)
2 an 3 are very similar,
trying with 4
k <- 4
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
seeds = sample(1:n,k),
warping_class = "none",
metric = "pearson",
centroid_type = "mean",
distance_relative_tolerance = 1e-3,
add_silhouettes = F,
parallel_method = 0L,
number_of_threads = 12
)
)
Information about the data set:
- Number of observations: 900
- Number of dimensions: 1
- Number of points: 7
Information about cluster initialization:
- Number of clusters: 4
- Initial seeds for cluster centers: 869 778 319 413
Information about the methods used within the algorithm:
- Warping method: none
- Center method: mean
- Dissimilarity method: pearson
- Optimization method: bobyqa
Information about warping parameter bounds:
- Warping options: 0.1500 0.1500
Information about convergence criteria:
- Maximum number of iterations: 100
- Distance relative tolerance: 0.001
Information about parallelization setup:
- Number of threads: 12
- Parallel method: 0
Other information:
- Use fence to robustify: 0
- Check total dissimilarity: 1
- Compute overall center: 0
Running k-centroid algorithm:
- Iteration #1
* Size of cluster #0: 485
* Size of cluster #1: 42
* Size of cluster #2: 354
* Size of cluster #3: 19
- Iteration #2
* Size of cluster #0: 433
* Size of cluster #1: 109
* Size of cluster #2: 305
* Size of cluster #3: 53
- Iteration #3
* Size of cluster #0: 387
* Size of cluster #1: 161
* Size of cluster #2: 264
* Size of cluster #3: 88
- Iteration #4
* Size of cluster #0: 348
* Size of cluster #1: 193
* Size of cluster #2: 243
* Size of cluster #3: 116
- Iteration #5
* Size of cluster #0: 324
* Size of cluster #1: 213
* Size of cluster #2: 227
* Size of cluster #3: 136
- Iteration #6
* Size of cluster #0: 309
* Size of cluster #1: 224
* Size of cluster #2: 216
* Size of cluster #3: 151
- Iteration #7
* Size of cluster #0: 295
* Size of cluster #1: 231
* Size of cluster #2: 208
* Size of cluster #3: 166
- Iteration #8
* Size of cluster #0: 285
* Size of cluster #1: 234
* Size of cluster #2: 208
* Size of cluster #3: 173
- Iteration #9
* Size of cluster #0: 279
* Size of cluster #1: 234
* Size of cluster #2: 207
* Size of cluster #3: 180
- Iteration #10
* Size of cluster #0: 275
* Size of cluster #1: 236
* Size of cluster #2: 204
* Size of cluster #3: 185
- Iteration #11
* Size of cluster #0: 275
* Size of cluster #1: 240
* Size of cluster #2: 201
* Size of cluster #3: 184
- Iteration #12
* Size of cluster #0: 273
* Size of cluster #1: 241
* Size of cluster #2: 200
* Size of cluster #3: 186
- Iteration #13
* Size of cluster #0: 273
* Size of cluster #1: 240
* Size of cluster #2: 200
* Size of cluster #3: 187
- Iteration #14
* Size of cluster #0: 272
* Size of cluster #1: 239
* Size of cluster #2: 200
* Size of cluster #3: 189
- Iteration #15
* Size of cluster #0: 271
* Size of cluster #1: 237
* Size of cluster #2: 200
* Size of cluster #3: 192
- Iteration #16
* Size of cluster #0: 271
* Size of cluster #1: 236
* Size of cluster #2: 199
* Size of cluster #3: 194
- Iteration #17
* Size of cluster #0: 272
* Size of cluster #1: 237
* Size of cluster #2: 199
* Size of cluster #3: 192
- Iteration #18
* Size of cluster #0: 273
* Size of cluster #1: 238
* Size of cluster #2: 198
* Size of cluster #3: 191
- Iteration #19
* Size of cluster #0: 273
* Size of cluster #1: 239
* Size of cluster #2: 198
* Size of cluster #3: 190
Active stopping criteria:
- The total dissimilarity did not decrease.
user system elapsed
3.143 0.170 3.291
autoplot(fdakma0der,type = "amplitude")
table(df_week2[fdakma0der$memberships==1,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
3 8 5 6 6 10 3 6 13 9 4 7 4 3 3 4 5 3 7 6 3 7 6 5 6 5 5 9 3 5 2 5
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
4 2 4 4 7 6 2 6 5 4 1 8 6 8 6 5 9 10
table(df_week2[fdakma0der$memberships==2,2])
week_of_year
2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
3 3 5 6 2 6 5 3 4 3 5 6 10 5 7 5 4 5 6 6 2 5 6 6 6 5 4 8 8 9 6 10
35 36 37 38 39 40 41 43 44 45 46 47 48 49 50
10 5 5 4 7 6 3 7 6 2 2 1 3 2 1
table(df_week2[fdakma0der$memberships==3,2])
week_of_year
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34
9 2 5 3 8 3 7 3 2 2 7 1 2 2 2 3 3 3 1 5 4 5 5 5 3 4 3 2 2 2 4 3
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
5 4 3 4 7 1 7 4 8 4 6 5 6 7 5 7
table(df_week2[fdakma0der$memberships==4,2])
week_of_year
2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
3 5 3 3 4 3 2 4 3 4 5 6 3 8 4 5 8 5 1 9 5 2 2 1 4 4 2 5 3 5 3 1
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
6 4 5 4 1 3 8 6 3 3 4 4 4 3 4 3 1
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
id3 <- which(fdakma0der$memberships==3)
id4 <- which(fdakma0der$memberships==4)
par(mfrow=c(2,2))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plot(fd_week2[id3[1:30],],lwd = 1,main = "cluster 3",col = 4)
plot(fd_week2[id4[1:30],],lwd = 1,main = "cluster 4",col = 5)
still better with 2.
system.time(
hclustres <- fdahclust(
x = x2,
y = y2,
n_clusters = 2L,
warping_class = "none",
centroid_type = "mean",
metric = "pearson",
linkage_criterion = "complete",
cluster_on_phase = FALSE,
use_verbose = TRUE,
warping_options = c(0.15, 0.15),
maximum_number_of_iterations = 100L,
number_of_threads = 12L,
parallel_method = 0L,
distance_relative_tolerance = 0.001,
use_fence = FALSE,
check_total_dissimilarity = TRUE,
compute_overall_center = FALSE
)
)
ℹ Computing the distance matrix...
autoplot(hclustres)
diagnostic_plot(hclustres)
table(hclustres$memberships)
1 2
820 80
matplot(t(hclustres$center_curves[,1,]),type = 'l',
main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)
the pattern is always the same, division of the two clusters.
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
seeds = sample(1:n,k),
warping_class = "none",
metric = "pearson",
centroid_type = "mean",
distance_relative_tolerance = 1e-3,
add_silhouettes = F,
parallel_method = 0L,
number_of_threads = 12
)
)
Information about the data set:
- Number of observations: 943
- Number of dimensions: 1
- Number of points: 7
Information about cluster initialization:
- Number of clusters: 2
- Initial seeds for cluster centers: 462 379
Information about the methods used within the algorithm:
- Warping method: none
- Center method: mean
- Dissimilarity method: pearson
- Optimization method: bobyqa
Information about warping parameter bounds:
- Warping options: 0.1500 0.1500
Information about convergence criteria:
- Maximum number of iterations: 100
- Distance relative tolerance: 0.001
Information about parallelization setup:
- Number of threads: 12
- Parallel method: 0
Other information:
- Use fence to robustify: 0
- Check total dissimilarity: 1
- Compute overall center: 0
Running k-centroid algorithm:
- Iteration #1
* Size of cluster #0: 452
* Size of cluster #1: 491
- Iteration #2
* Size of cluster #0: 450
* Size of cluster #1: 493
- Iteration #3
* Size of cluster #0: 447
* Size of cluster #1: 496
- Iteration #4
* Size of cluster #0: 442
* Size of cluster #1: 501
- Iteration #5
* Size of cluster #0: 442
* Size of cluster #1: 501
- Iteration #6
* Size of cluster #0: 457
* Size of cluster #1: 486
- Iteration #7
* Size of cluster #0: 464
* Size of cluster #1: 479
- Iteration #8
* Size of cluster #0: 480
* Size of cluster #1: 463
- Iteration #9
* Size of cluster #0: 481
* Size of cluster #1: 462
- Iteration #10
* Size of cluster #0: 478
* Size of cluster #1: 465
- Iteration #11
* Size of cluster #0: 484
* Size of cluster #1: 459
- Iteration #12
* Size of cluster #0: 493
* Size of cluster #1: 450
- Iteration #13
* Size of cluster #0: 495
* Size of cluster #1: 448
- Iteration #14
* Size of cluster #0: 491
* Size of cluster #1: 452
- Iteration #15
* Size of cluster #0: 487
* Size of cluster #1: 456
- Iteration #16
* Size of cluster #0: 487
* Size of cluster #1: 456
- Iteration #17
* Size of cluster #0: 488
* Size of cluster #1: 455
- Iteration #18
* Size of cluster #0: 488
* Size of cluster #1: 455
Active stopping criteria:
- Memberships did not change.
user system elapsed
1.754 0.098 1.848
autoplot(fdakma0der,type = "amplitude")
table(df_week[fdakma0der$memberships==1,2])
week_of_year
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
10 10 4 9 7 6 5 10 10 4 7 9 9 11 13 12 12 13 13 7 10 14 10 10 11 10 12 9 5 11 12 13
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
10 14 14 12 13 9 10 11 7 5 11 13 6 9 5 9 7 4 1 6 4
table(df_week[fdakma0der$memberships==2,2])
week_of_year
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
8 8 14 9 11 12 13 8 8 14 11 9 9 7 5 6 6 5 5 11 8 4 8 8 7 8 6 9 13 7 6 5
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
8 4 4 6 5 9 8 7 11 13 7 5 12 9 13 9 11 14 17 12 3
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plotting the weeks differently:
df <- data.frame(table(df_week[fdakma0der$memberships==1,2]),table(df_week[fdakma0der$memberships==2,2])) %>% select(-week_of_year.1)
df %>% ggplot(aes(order(week_of_year),Freq)) + geom_line()
this shows that group 1 are the work week, while cluster 2 are the holiays.